Welcome! The purpose of this project is to produce an algorithm that can help the online real-estate website, Zillow.com predict home sale prices with greater accuracy. Improving real-estate valuations are important for several reasons. Namely it:
The gains from such an algorithm are worthwhile, however this project proved to be a challenging exercise. Namely, it is incredibly difficult to quantify the features which influence home values without leaning into historical disparities (homes in ‘good’ versus ‘bad’ areas are often paired with race|class|amenity imparity.) Likewise, the presence of colinear variables may effect the performance of our model.
We employed the hedonic model for this project (HM). Simply put, HM refers to a regression model that estimates the influence various factors have on the price of a good (i.e.home) and is commonly used in real estate pricing. In this case, home price served as our dependent variable, whilst various features served as our independent variables. We identified a collection of both physical and demographic features we believed influence the price of a home, cleaned these datasets and tested their significance towards home valuation. Furthermore, we checked our results through cross-validation testing, and employed statistical metrics such as mean absolute error (MAE), mean absolute percentage error (MAPE), and Moran’s I.
Ultimately, we produced a functional model though with a major weakness: it is not generalizable across majority white vs majority non-white neighborhoods. Specifically, we saw an increase in our MAPE within majority non-white neighborhoods as well as lower average home valuations. We review the reasons for this later.
Nonetheless, this project was a great exercise. We look forward to you reviewing our model and code!
To gather data, our team focused on Mecklenberg County, NC (Charlotte Metro Area) and sourced information from the county’s open data website, as well as the American Community Survey and U.S.Census.
To begin, we will import a home sales dataset that includes variables like location, housing characteristics, and home quality for the Charlotte Metro Area. After, we will ‘clean’ our data by creating useful columns such as, “building grade” as a numeric value (where higher values correspond to greater quality), “age of home (age)”, “price per square foot (pricesqft)” and calculating the # of “total baths (totbaths)” by joining full and half-bathroom information. Moving forward, we’ll refer to this home sales data as “internal variables.”
#Import sales data, remove columns we won't need, add useful columns, set CRS for North Carolina
library(dplyr)
library(sf)
CLT_internal <-
st_read("https://github.com/mafichman/MUSA_508_Lab/raw/main/Midterm/data/2022/studentData.geojson") %>%
st_transform('ESRI:103500')
CLT_internal <- CLT_internal[c(5,9,20,21,26,28,30:46,57:60,67,68,70,71,72)] %>%
mutate(
age = 2022 - yearbuilt,
sqft = (totalac * 43560),
pricesqft = ((totalac * 43560)/price),
totbaths = (halfbaths*0.5)+(fullbaths))
CLT_internal$quality <- recode(CLT_internal$bldggrade, MINIMUM = 1 , FAIR = 2, AVERAGE = 3, GOOD = 4, VERYGOOD = 5, EXCELLENT = 6, CUSTOM = 7)
To build a strong algorithm (model), it’s important to include variables that relate to the housing market such as local schools, grocery stores, and parks. We’ll refer to these variables as “amenities.”
# Adding school data
CLT_schools <-
st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/Schools.geojson") %>%
st_transform(st_crs(CLT_internal))
# Adding grocery store data
CLT_grocery <-
st_read("Grocery_pts.geojson") %>%
st_transform(st_crs(CLT_internal))
# Adding parks data
CLT_parks <-
st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/Parks.geojson") %>%
st_transform(st_crs(CLT_internal))
Finally, we will add variables that provide demographic and environmental data for the Charlotte Metro Area. Specifically, we will include educational attainment, household income, and neighborhoods data. We’ll refer to these variables as “spatial structure.”
# Adding demographic data
CLT_demo <-
get_acs(geography = "tract",
variables = c("B19013_001E", "B15003_022E","B15003_001E"),
year=2020, state=37, county=119,
geometry=TRUE, output="wide") %>%
st_transform(st_crs(CLT_internal)) %>%
dplyr::select( -NAME, -B19013_001M, -B15003_022M, -B15003_001M)
CLT_demo <-
CLT_demo %>%
rename(HH_inc = B19013_001E,
College = B15003_022E,
College_age_pop = B15003_001E) %>%
mutate(college_perc = College/College_age_pop) %>%
dplyr::select( -College, -College_age_pop)
# Adding neighborhood data
CLT_neighborhoods <-
st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/School_districts.geojson") %>%
st_transform(st_crs(CLT_internal))
So far, we have added internal, amenities, and spatial structure variables. However, in order to build our model and analyze how these variables relate to home sales, we must modify them. We’ll achieve this using 2 techniques:
1. K-nearest neighbor (KNN): this will find the distance between a given home and the most near amenities (school, grocery store, park).
2. Spatial join (SJ): this will join our spatial structure data (educational attainment, neighborhoods) to our internal varies (Charlotte homes sales)
# Most near school, grocery store, and park
CLT_internal <-
CLT_internal %>%
mutate(
school_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_schools)),k = 1),
grocery_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_grocery)), k = 1),
park_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_parks)), k = 1))
# Spatial join
CLT_internal <-
st_join(CLT_internal,CLT_demo)
CLT_internal <-
st_join(CLT_internal, CLT_neighborhoods)
Below are summary statistics tables for each variable category (internal, amenities, spatial structure).
#Internal variables
ss_internal <- CLT_internal
ss_internal <- st_drop_geometry(ss_internal)
ss_internal <- ss_internal %>%
dplyr::select("sqft","pricesqft", "totbaths", "yearbuilt")
stargazer(as.data.frame(ss_internal), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Internal Variables")
##
## Descriptive Statistics for Charlotte Metro Area Homes Internal Variables
## =======================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------------
## sqft 46,183 77,176.9 4,648,606.0 0.0 425,034,086.0
## pricesqft 46,138 Inf.0 0.0 Inf.0
## totbaths 46,183 2.6 0.9 0.0 11.0
## yearbuilt 46,183 1,993.4 31.8 0 2,022
## -------------------------------------------------------
#Amenities
ss_amenities <- CLT_internal
ss_amenities <- st_drop_geometry(ss_amenities)
ss_amenities <- ss_amenities %>%
dplyr::select("school_nn1", "grocery_nn1", "park_nn1")
stargazer(as.data.frame(ss_amenities), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Amentity Variables")
##
## Descriptive Statistics for Charlotte Metro Area Homes Amentity Variables
## ================================================
## Statistic N Mean St. Dev. Min Max
## ------------------------------------------------
## school_nn1 46,183 3,266.8 1,390.4 8.2 8,369.4
## grocery_nn1 46,183 1,737.8 1,274.4 37.0 8,599.7
## park_nn1 46,183 1,239.8 767.7 21.0 5,026.3
## ------------------------------------------------
# Spatial Structure
ss_spatial <- CLT_internal
ss_spatial <- st_drop_geometry(ss_spatial)
ss_spatial <- ss_spatial %>%
dplyr::select("HH_inc", "college_perc", "FID")
stargazer(as.data.frame(ss_spatial), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Internal Variables")
##
## Descriptive Statistics for Charlotte Metro Area Homes Internal Variables
## ====================================================
## Statistic N Mean St. Dev. Min Max
## ----------------------------------------------------
## HH_inc 46,180 86,175.7 37,415.7 17,685 238,750
## college_perc 46,180 0.3 0.1 0.03 0.6
## FID 46,183 26.0 10.6 0 43
## ----------------------------------------------------
Below is a table visualizing correlation between our variables. We can see the home price maintains a positive correlation with the following variables (in order of strength):
We can use this matrix to inform our variable selection for our model.
numericVars <- select_if(st_drop_geometry(CLT_internal), is.numeric) %>% na.omit()
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c("deepskyblue", "grey100", "firebrick1"),
type="lower",
insig = "blank") +
labs(title = "Correlation Matrix of Numeric Variables", tl.cex = 0.5, tl.col = "black") +
theme(axis.text.x = element_text(size = 10)) +
plotTheme()
Below are 4 home price correlation scatterplots based upon the results of our correlation matrix:
st_drop_geometry(CLT_internal) %>%
dplyr::select(price, quality, heatedarea, HH_inc, yearbuilt) %>%
filter(price < 10000000) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "hotpink") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Price as a function of Internal and Spatial Variables") +
theme(axis.text.x = element_text(size=7)) +
plotTheme()
Below are 4 maps including:
1. A map of our dependent variable (price)
2. A map of park locations
3. A map of nearby grocery stores
4. A map of school quality
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = CLT_neighborhoods, fill = "grey40") +
geom_sf(data = CLT_internal, aes(colour = q5(price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(CLT_internal,"price"),
name="Quintile\nBreaks") +
labs(title="Home Price", subtitle="Charlotte Metro Area") +
labs(color = "Observed Sales Price (quintiles)") +
theme(plot.title=element_text(size=12, face='bold'),
legend.key.size = unit(1.5, 'cm'), #change legend key size
legend.key.height = unit(1, 'cm'), #change legend key height
legend.key.width = unit(1, 'cm'), #change legend key width
legend.title = element_text(size=14), #change legend title font size
legend.text = element_text(size=10)) + #change legend text font size
mapTheme(),
ggplot() +
geom_sf(data = CLT_neighborhoods, fill = "grey70") +
geom_sf(data = CLT_internal, aes(colour = q5(price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(CLT_internal,"price"),
name="Quintile\nBreaks") +
geom_sf(data = CLT_parks, color="darkgreen") +
labs(title="Park Locations vs Home Price", subtitle="Charlotte Metro Area") +
theme(plot.title=element_text(size=12, face='bold'),
legend.key.size = unit(1.5, 'cm'),
legend.key.height = unit(1, 'cm'),
legend.key.width = unit(1, 'cm'),
legend.title = element_text(size=14),
legend.text = element_text(size=10)) +
mapTheme(),
ggplot() +
geom_sf(data = CLT_neighborhoods, fill = "grey30") +
geom_sf(data = CLT_internal, aes(colour = q5(price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(CLT_internal,"price"),
name="Quintile\nBreaks") +
geom_sf(data = CLT_grocery, color="deepskyblue") +
labs(title="Grocery Store Locations vs Home Price", subtitle="Charlotte Metro Area") +
theme(plot.title=element_text(size=12, face='bold'),
legend.key.size = unit(1.5, 'cm'),
legend.key.height = unit(1, 'cm'),
legend.key.width = unit(1, 'cm'),
legend.title = element_text(size=14),
legend.text = element_text(size=10)) +
mapTheme(),
ggplot() +
geom_sf(data = CLT_schools, aes(fill=factor(Quality), color=factor(Quality))) +
scale_fill_brewer(palette="RdYlGn") +
scale_color_brewer(palette="RdYlGn") +
labs(title="School Quality", subtitle="Niche.com ratings; Charlotte Metro Area") +
theme(plot.title=element_text(size=12, face='bold'),
legend.key.size = unit(1.25, 'cm'),
legend.key.height = unit(1, 'cm'),
legend.key.width = unit(1, 'cm'),
legend.title = element_text(size=14),
legend.text = element_text(size=10)) +
mapTheme())
Now that we have identified important variables, we will build out model by dividing the data into training/test sets. Our data will be partitioned as a 60/40 train-test split. After, we’ll run a regression on our training set (60%) and use the results to determine the generalizability with our ‘test’ data.
However, before dividing the dataset we will create categorical variables for the number of floors, bathrooms, and bedrooms for a given home.
#Re-engineering data as categorical: number of floors
CLT_internal<-
CLT_internal%>%
mutate(NUM_FLOORS.cat = ifelse(storyheigh == "1 STORY" | storyheigh == "1.5 STORY" | storyheigh == "SPLIT LEVEL" | storyheigh == "2.0 STORY", "Up to 2 Floors",
ifelse(storyheigh == "2.5 STORY" | storyheigh == "3.0 STORY", "Up to 3 Floors", "4+ Floors")))
#Re-engineer bedroom as categorical
CLT_internal <-
CLT_internal %>%
mutate(NUM_BEDS.cat = ifelse(bedrooms <= 2, "Up to 2 Bedrooms",
ifelse(bedrooms == 3 | bedrooms == 4, "Up to 4 Bedrooms", "5+ Bedrooms")))
#Re-engineer bathroom data as categorical
CLT_internal <-
CLT_internal %>%
mutate(NUM_BATHS.cat = ifelse(totbaths <= 2.5, "Up to 2.5 Bathrooms",
ifelse(totbaths <= 3.5 | totbaths <= 4.5, "Up to 4 Bathrooms", "5+ Bathrooms")))
#Creating training data
inTrain <- createDataPartition(
y = paste(CLT_internal$NUM_FLOORS.cat, CLT_internal$NUM_BEDS.cat),
p = .60, list = FALSE)
charlotte.training <- CLT_internal[inTrain,]
charlotte.test <- CLT_internal[-inTrain,]
reg.training <- lm(price ~ ., data = st_drop_geometry(charlotte.training) %>%
dplyr::select(price, heatedarea,
quality, NUM_FLOORS.cat,
NUM_BEDS.cat, NUM_BATHS.cat,
park_nn1, grocery_nn1,
age, HH_inc, college_perc))
summary(reg.training)
##
## Call:
## lm(formula = price ~ ., data = st_drop_geometry(charlotte.training) %>%
## dplyr::select(price, heatedarea, quality, NUM_FLOORS.cat,
## NUM_BEDS.cat, NUM_BATHS.cat, park_nn1, grocery_nn1, age,
## HH_inc, college_perc))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1649590 -83549 -9315 59577 43774477
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -279838.2366 45623.6383 -6.134
## heatedarea 131.1079 5.2000 25.213
## quality 191626.3147 5818.6647 32.933
## NUM_FLOORS.catUp to 2 Floors 52533.5680 24550.3967 2.140
## NUM_FLOORS.catUp to 3 Floors 34625.6208 31696.3866 1.092
## NUM_BEDS.catUp to 2 Bedrooms 42387.2026 17687.7206 2.396
## NUM_BEDS.catUp to 4 Bedrooms 34603.3971 11666.4011 2.966
## NUM_BATHS.catUp to 2.5 Bathrooms -405010.7599 28414.4206 -14.254
## NUM_BATHS.catUp to 4 Bathrooms -411105.8102 25842.3868 -15.908
## park_nn1 -4.7558 3.7905 -1.255
## grocery_nn1 -23.2341 2.5575 -9.085
## age 2089.0282 138.4629 15.087
## HH_inc 0.8801 0.1318 6.679
## college_perc 35598.9525 37417.5895 0.951
## Pr(>|t|)
## (Intercept) 0.0000000008716 ***
## heatedarea < 0.0000000000000002 ***
## quality < 0.0000000000000002 ***
## NUM_FLOORS.catUp to 2 Floors 0.03238 *
## NUM_FLOORS.catUp to 3 Floors 0.27466
## NUM_BEDS.catUp to 2 Bedrooms 0.01656 *
## NUM_BEDS.catUp to 4 Bedrooms 0.00302 **
## NUM_BATHS.catUp to 2.5 Bathrooms < 0.0000000000000002 ***
## NUM_BATHS.catUp to 4 Bathrooms < 0.0000000000000002 ***
## park_nn1 0.20961
## grocery_nn1 < 0.0000000000000002 ***
## age < 0.0000000000000002 ***
## HH_inc 0.0000000000246 ***
## college_perc 0.34141
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 452000 on 25557 degrees of freedom
## (2143 observations deleted due to missingness)
## Multiple R-squared: 0.2512, Adjusted R-squared: 0.2509
## F-statistic: 659.6 on 13 and 25557 DF, p-value: < 0.00000000000000022
To test the strength (accuracy) of our model in its ability to predict prices, we will:
#Creating predictions and calculating Mean Absolute Error (MAE) and Mean Absolute Percent Error (MAPE)
charlotte.test <-
charlotte.test %>%
mutate(price.Predict = predict(reg.training, charlotte.test),
price.Error = price.Predict - price,
price.AbsError = abs(price.Predict - price),
price.APE = (abs(price.Predict - price)) / price.Predict)%>%
filter(price < 5000000)
MAE <- mean(charlotte.test$price.AbsError, na.rm = T)
MAPE <- mean(charlotte.test$price.APE, na.rm = T)
reg.MAE.MAPE <-
cbind(MAE, MAPE) %>%
kable(caption = "Regression MAE & MAPE") %>%
kable_styling("hover",full_width = F)
reg.MAE.MAPE
| MAE | MAPE |
|---|---|
| 108449 | 0.0927701 |
#Cross-validation
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(price ~ ., data = st_drop_geometry(CLT_internal) %>%
dplyr::select(price, heatedarea,
quality, NUM_FLOORS.cat,
NUM_BEDS.cat, NUM_BATHS.cat, grocery_nn1,
age, HH_inc, college_perc,
park_nn1),
method = "lm", trControl = fitControl, na.action = na.pass)
# Table
stargazer(as.data.frame(reg.cv$resample), type="text", digits=1, title="Cross Validation Results")
##
## Cross Validation Results
## =======================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------------
## RMSE 100 268,328.6 296,854.0 133,413.4 2,137,604.0
## Rsquared 100 0.6 0.2 0.01 0.8
## MAE 100 114,163.2 17,868.1 93,420.4 207,393.9
## -------------------------------------------------------
# Plot
ggplot(reg.cv$resample, aes(x=MAE)) +
geom_histogram(fill = "darkgreen") +
labs(title = "Count of Mean Average Error During Cross-Validation") +
xlab("MAE")+
ylab("Count")+
plotTheme()
#Visualizing prediction errors
charlotte.APE <- charlotte.test[c(6,36,43:46,51,54)]
charlotte_APE.sf <-
charlotte.APE %>%
filter(price.APE > 0) %>%
st_as_sf(sf_column_name=geometry) %>%
st_transform('ESRI:103500')
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = CLT_neighborhoods, fill = "grey40") +
geom_sf(data = charlotte_APE.sf, aes(color = q5(price.Predict)), size = .25) +
scale_colour_manual(values = palette5,
labels=qBr(charlotte.APE,"price"),
name="Quintile\nBreaks") +
labs(title="Predicted Sales Price") +
mapTheme(),
ggplot() +
geom_sf(data = CLT_neighborhoods, fill = "grey40") +
geom_sf(data = charlotte_APE.sf, aes(color = price.APE), size = .25) +
labs(title="Predicted Sales Price\nAbsolute Percent Error") +
binned_scale(aesthetics = "color",
scale_name = "stepsn",
palette = function(x) c("#1a9641", "#a6d96a", "#ffffbf", "#fdae61", "#d7191c"),
breaks = c(0.10, 0.20, 0.5, 0.75),
limits = c(0, 50),
show.limits = TRUE,
guide = "colorsteps"
) +
mapTheme())
#Predicted vs observed sales price
ggplot(
charlotte_APE.sf, aes(price, price.Predict, col = price.APE)) +
binned_scale(aesthetics = "color",
scale_name = "stepsn",
palette = function(x) c("#1a9641", "#a6d96a", "#ffffbf", "#fdae61", "#d7191c"),
breaks = c(0.10, 0.20, 0.5, 0.75),
limits = c(0, 50),
show.limits = TRUE,
guide = "colorsteps") +
geom_point(size=1) +
scale_y_continuous(limits = c(0, 4000000)) +
scale_x_continuous(limits = c(0, 4000000)) +
labs(title="Sales Price vs. Predicted", subtitle="Charlotte Metro Area") +
ylab("Predicted Sales Price (in dollars)") +
xlab("Observed Sales Price (in dollars)") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", size = 0.5) +
labs(color = "Absolute % Error") +
geom_label(
label="0% error line",
x=3500000,
y=3000000,
label.padding = unit(0.25, "lines"), # Rectangle size around label
label.size = 0.15,
color = "black",
fill="grey80")
By calculating Moran’s I for this dataset, we are determining if there is a spatial autocorrelation. relationship among home sales in Charlotte. As seen in the outcome, Moran’s I was calculated to be high, meaning there is a spatial relationship in the predicted errors that must be accounted for.
#Calculating Moran's I
coords.test <- st_coordinates(charlotte.test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
charlotte.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK = TRUE))
## Simple feature collection with 18451 features and 54 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 422151.1 ymin: 141808.1 xmax: 466446 ymax: 196645.1
## Projected CRS: NAD_1983_CORS96_StatePlane_North_Carolina_FIPS_3200
## First 10 features:
## pid totalac codemunici municipali zipcode price yearbuilt
## 1 00101305 0.7700 6 HUNTERSVILLE 28078 375000 1976
## 2 00101323 0.0000 6 HUNTERSVILLE 28078 1125000 1991
## 3 00101353 0.4789 <NA> HUNTERSVILLE 28078 710000 2008
## 4 00101433 0.0000 6 HUNTERSVILLE 28078 675000 2003
## 5 00101434 0.0000 6 HUNTERSVILLE 28078 690000 1985
## 6 00101488 0.5570 6 HUNTERSVILLE 28078 1135000 2017
## 7 00102221 0.0000 6 HUNTERSVILLE 28078 587500 1988
## 8 00102318 1.2820 6 HUNTERSVILLE 28081 819500 1976
## 9 00102322 0.0000 6 HUNTERSVILLE 28078 890000 1982
## 10 00102353 0.8200 6 HUNTERSVILLE 28078 725000 1968
## heatedarea cdebuildin descbuildi storyheigh aheatingty heatedfuel actype
## 1 1596 01 RES 1 STORY AIR-DUCTED GAS AC-CENTRAL
## 2 4272 01 RES 1 STORY AIR-DUCTED GAS AC-CENTRAL
## 3 3377 01 RES 2.0 STORY AIR-DUCTED GAS AC-CENTRAL
## 4 2716 01 RES 1 STORY AIR-DUCTED GAS AC-CENTRAL
## 5 2100 01 RES 1 STORY AIR-DUCTED GAS AC-CENTRAL
## 6 5031 01 RES 2.5 STORY AIR-DUCTED GAS AC-CENTRAL
## 7 2870 01 RES 2.0 STORY HEAT PUMP ELECTRIC AC-CENTRAL
## 8 3871 01 RES 1 STORY AIR-DUCTED ELECTRIC AC-CENTRAL
## 9 2042 01 RES 1 STORY AIR-DUCTED ELECTRIC AC-CENTRAL
## 10 1650 01 RES 1 STORY AIR-DUCTED GAS AC-CENTRAL
## extwall foundation numfirepla fireplaces bldggrade fullbaths
## 1 FACE BRICK CRAWL SPACE 1 FP3 AVERAGE 2
## 2 FACE BRICK CRAWL SPACE 1 FP3 VERY GOOD 4
## 3 HARDIPLK/DSGN VINYL CRAWL SPACE 1 FP2 GOOD 3
## 4 ALUM,VINYL CRAWL SPACE 1 FP2 GOOD 2
## 5 CEDAR,RDWD CRAWL SPACE 1 FP3 GOOD 2
## 6 HARDIPLK/DSGN VINYL SLAB-RES 1 FP2 EXCELLENT 5
## 7 WOOD ON SHTG CRAWL SPACE 1 FP3 GOOD 2
## 8 FACE BRICK CRAWL SPACE 1 FP3 AVERAGE 3
## 9 WOOD ON SHTG CRAWL SPACE 1 FP3 AVERAGE 2
## 10 FACE BRICK CRAWL SPACE 1 FP3 AVERAGE 2
## halfbaths bedrooms units landusecod physicalde propertyus descproper
## 1 0 3 1 R100 AV 01 Single-Family
## 2 0 4 1 R122 AV 01 Single-Family
## 3 1 4 1 R120 AV 01 Single-Family
## 4 1 3 1 R100 AV 01 Single-Family
## 5 0 3 1 R122 AV 01 Single-Family
## 6 1 6 1 R122 AV 01 Single-Family
## 7 1 3 1 R100 AV 01 Single-Family
## 8 0 3 1 R122 AV 01 Single-Family
## 9 0 3 1 R122 AV 01 Single-Family
## 10 0 3 1 R122 AV 01 Single-Family
## shape_Leng shape_Area musaID toPredict age sqft pricesqft totbaths
## 1 856.7594 36157.19 1 MODELLING 46 33541.20 0.08944320 2.0
## 2 908.4596 38364.22 4 MODELLING 31 0.00 0.00000000 4.0
## 3 569.7271 20847.22 11 MODELLING 14 20860.88 0.02938153 3.5
## 4 686.9222 19762.49 14 MODELLING 19 0.00 0.00000000 2.5
## 5 643.5813 20165.04 15 MODELLING 37 0.00 0.00000000 2.0
## 6 934.8900 23748.70 18 MODELLING 5 24262.92 0.02137702 5.5
## 7 650.2860 25235.96 22 MODELLING 34 0.00 0.00000000 2.5
## 8 942.4561 55826.73 26 MODELLING 46 55843.92 0.06814389 3.0
## 9 719.6004 24511.45 28 MODELLING 40 0.00 0.00000000 2.0
## 10 865.5680 35704.50 32 MODELLING 54 35719.20 0.04926786 2.0
## quality school_nn1 grocery_nn1 park_nn1 GEOID HH_inc college_perc FID
## 1 3 6805.840 3981.944 2435.372 37119006220 92038 0.293934 24
## 2 NA 6920.620 3666.694 2741.707 37119006220 92038 0.293934 24
## 3 4 6891.681 4098.060 2427.413 37119006220 92038 0.293934 24
## 4 4 6326.410 3685.299 2338.371 37119006220 92038 0.293934 24
## 5 4 6346.031 3667.290 2363.168 37119006220 92038 0.293934 24
## 6 6 6503.698 3866.825 2300.176 37119006220 92038 0.293934 24
## 7 4 6186.072 2484.837 1378.492 37119006220 92038 0.293934 24
## 8 3 5861.251 2384.744 1175.440 37119006220 92038 0.293934 24
## 9 3 6026.585 2254.092 1096.034 37119006220 92038 0.293934 24
## 10 3 6391.280 2102.854 1049.540 37119006220 92038 0.293934 24
## MIDD_NAME Shape_Leng Shape_Area geometry NUM_FLOORS.cat
## 1 Bradley 2486536 1051124394 POINT (433689.7 188331.5) Up to 2 Floors
## 2 Bradley 2486536 1051124394 POINT (433952.9 188553.6) Up to 2 Floors
## 3 Bradley 2486536 1051124394 POINT (433556.8 188369.4) Up to 2 Floors
## 4 Bradley 2486536 1051124394 POINT (434138.4 187989) Up to 2 Floors
## 5 Bradley 2486536 1051124394 POINT (434147.3 188012.9) Up to 2 Floors
## 6 Bradley 2486536 1051124394 POINT (433898.1 188089.1) Up to 3 Floors
## 7 Bradley 2486536 1051124394 POINT (435415.7 188153.8) Up to 2 Floors
## 8 Bradley 2486536 1051124394 POINT (435774.5 187865.8) Up to 2 Floors
## 9 Bradley 2486536 1051124394 POINT (435797.5 188033.5) Up to 2 Floors
## 10 Bradley 2486536 1051124394 POINT (435665.3 188476.5) Up to 2 Floors
## NUM_BEDS.cat NUM_BATHS.cat price.Predict price.Error
## 1 Up to 4 Bedrooms Up to 2.5 Bathrooms 269876.2 -105123.79
## 2 Up to 4 Bedrooms Up to 4 Bathrooms NA NA
## 3 Up to 4 Bedrooms Up to 4 Bathrooms 619401.7 -90598.34
## 4 Up to 4 Bedrooms Up to 2.5 Bathrooms 559293.2 -115706.82
## 5 Up to 4 Bedrooms Up to 2.5 Bathrooms 516433.7 -173566.27
## 6 5+ Bedrooms 5+ Bathrooms 1565277.6 430277.58
## 7 Up to 4 Bedrooms Up to 2.5 Bathrooms 643275.9 55775.90
## 8 Up to 4 Bedrooms Up to 4 Bathrooms 605153.1 -214346.90
## 9 Up to 4 Bedrooms Up to 2.5 Bathrooms 362330.9 -527669.08
## 10 Up to 4 Bedrooms Up to 2.5 Bathrooms 343918.0 -381081.96
## price.AbsError price.APE lagPriceError
## 1 105123.79 0.38952597 NA
## 2 NA NA -10943.528
## 3 90598.34 0.14626751 NA
## 4 115706.82 0.20688044 6108.954
## 5 173566.27 0.33608625 17680.845
## 6 430277.58 0.27488900 -103087.927
## 7 55775.90 0.08670603 NA
## 8 214346.90 0.35420276 -73319.493
## 9 527669.08 1.45631812 NA
## 10 381081.96 1.10806042 NA
moranTest <- moran.mc(charlotte.test$price.Error,
spatialWeights.test, nsim = 999, na.action=na.exclude, , zero.policy = TRUE)
# Observed and permuted Moran's I
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
plotTheme()
We introduce neighborhoods into the model in an attempt to account for spatial bias in our predictions. Specifically, we have included “Census Block Groups” as it was readily available information. The addition of this variable increases the R-squared of our model, meaning it is able to explain more of the observed variance with neighborhoods included than without it.
#Adjusting for neighborhood
reg.nhood <- lm(price ~ ., data = as.data.frame(charlotte.training) %>%
dplyr::select(price, heatedarea,
quality, NUM_FLOORS.cat,
NUM_BEDS.cat, NUM_BATHS.cat,
park_nn1, grocery_nn1,
age, HH_inc, college_perc))
summary(reg.nhood)
##
## Call:
## lm(formula = price ~ ., data = as.data.frame(charlotte.training) %>%
## dplyr::select(price, heatedarea, quality, NUM_FLOORS.cat,
## NUM_BEDS.cat, NUM_BATHS.cat, park_nn1, grocery_nn1, age,
## HH_inc, college_perc))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1649590 -83549 -9315 59577 43774477
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -279838.2366 45623.6383 -6.134
## heatedarea 131.1079 5.2000 25.213
## quality 191626.3147 5818.6647 32.933
## NUM_FLOORS.catUp to 2 Floors 52533.5680 24550.3967 2.140
## NUM_FLOORS.catUp to 3 Floors 34625.6208 31696.3866 1.092
## NUM_BEDS.catUp to 2 Bedrooms 42387.2026 17687.7206 2.396
## NUM_BEDS.catUp to 4 Bedrooms 34603.3971 11666.4011 2.966
## NUM_BATHS.catUp to 2.5 Bathrooms -405010.7599 28414.4206 -14.254
## NUM_BATHS.catUp to 4 Bathrooms -411105.8102 25842.3868 -15.908
## park_nn1 -4.7558 3.7905 -1.255
## grocery_nn1 -23.2341 2.5575 -9.085
## age 2089.0282 138.4629 15.087
## HH_inc 0.8801 0.1318 6.679
## college_perc 35598.9525 37417.5895 0.951
## Pr(>|t|)
## (Intercept) 0.0000000008716 ***
## heatedarea < 0.0000000000000002 ***
## quality < 0.0000000000000002 ***
## NUM_FLOORS.catUp to 2 Floors 0.03238 *
## NUM_FLOORS.catUp to 3 Floors 0.27466
## NUM_BEDS.catUp to 2 Bedrooms 0.01656 *
## NUM_BEDS.catUp to 4 Bedrooms 0.00302 **
## NUM_BATHS.catUp to 2.5 Bathrooms < 0.0000000000000002 ***
## NUM_BATHS.catUp to 4 Bathrooms < 0.0000000000000002 ***
## park_nn1 0.20961
## grocery_nn1 < 0.0000000000000002 ***
## age < 0.0000000000000002 ***
## HH_inc 0.0000000000246 ***
## college_perc 0.34141
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 452000 on 25557 degrees of freedom
## (2143 observations deleted due to missingness)
## Multiple R-squared: 0.2512, Adjusted R-squared: 0.2509
## F-statistic: 659.6 on 13 and 25557 DF, p-value: < 0.00000000000000022
charlotte.test.nhood <-
charlotte.test %>%
mutate(Regression = "Neighborhood Effects",
price.Predict = predict(reg.nhood, charlotte.test),
price.Error = price.Predict- price,
price.AbsError = abs(price.Predict- price),
price.APE = (abs(price.Predict- price)) / price)%>%
filter(price < 5000000)
charlotte.test <-charlotte.test %>%
mutate(Regression = "Baseline")
sales_predictions.sf <- CLT_internal %>%
mutate(price.Predict = predict(reg.nhood, CLT_internal)) %>%
filter(toPredict == "CHALLENGE")
sales_predictions.df <- as.data.frame(st_drop_geometry(sales_predictions.sf))
sales_predictions.df <- sales_predictions.df[c(30,43)]
write.csv(sales_predictions.df,"Quisqueyanes.csv", row.names = FALSE)
bothRegressions <-
rbind(
dplyr::select(charlotte.test, starts_with("price"), Regression, MIDD_NAME) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK=TRUE)),
dplyr::select(charlotte.test.nhood, starts_with("price"), Regression, MIDD_NAME) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK=TRUE)))
While controlling for neighborhood improved our model, the difference is small and hard to notice visually as seen in the plot below:
#Neighborhood effect results
bothRegressions %>%
dplyr::select(price.Predict, price, Regression) %>%
ggplot(aes(price, price.Predict)) +
geom_point() +
stat_smooth(aes(price, price),
method = "lm", se = FALSE, size = 1, colour="#FA7800") +
stat_smooth(aes(price.Predict, price),
method = "lm", se = FALSE, size = 1, colour="#25CB10") +
facet_wrap(~Regression) +
labs(title="Predicted sale price as a function of observed price",
subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
theme(axis.text.x = element_text(margin = margin(3, 3, 3, 3)))
plotTheme()
## List of 18
## $ text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "black"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.title :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 12
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 10
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.ticks : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ legend.background: list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ legend.text :List of 11
## ..$ family : NULL
## ..$ face : chr "italic"
## ..$ colour : chr "black"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ legend.title :List of 11
## ..$ family : NULL
## ..$ face : chr "italic"
## ..$ colour : chr "black"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ panel.background : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ panel.border :List of 5
## ..$ fill : logi NA
## ..$ colour : chr "black"
## ..$ size : num 2
## ..$ linetype : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
## $ panel.grid.major :List of 6
## ..$ colour : chr "grey80"
## ..$ size : num 0.1
## ..$ linetype : NULL
## ..$ lineend : NULL
## ..$ arrow : logi FALSE
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_line" "element"
## $ panel.grid.minor : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ plot.background : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ plot.title :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "black"
## ..$ size : num 16
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ plot.subtitle :List of 11
## ..$ family : NULL
## ..$ face : chr "italic"
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ plot.caption :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ strip.background :List of 5
## ..$ fill : chr "grey80"
## ..$ colour : chr "white"
## ..$ size : NULL
## ..$ linetype : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
## $ strip.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 12
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ strip.text.x :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 14
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
#Scatter plot of MAPE by neighborhood mean price
npa.mean.sf <- charlotte.test %>%
drop_na(price.APE) %>%
group_by(MIDD_NAME) %>%
summarise(mean_APE = mean(price.APE))
npa.price.sf <- charlotte.test %>%
drop_na(price.APE) %>%
group_by(MIDD_NAME) %>%
summarise(mean_Price = mean(price.Predict))
MAPE_by_NPA <- merge(st_drop_geometry(npa.mean.sf), npa.price.sf, by="MIDD_NAME")
grid.arrange(ncol=2,
ggplot(MAPE_by_NPA, aes(mean_Price, mean_APE))+
geom_jitter(height=2, width=2)+
ylim(-5,5)+
geom_smooth(method = "lm", aes(mean_Price, mean_APE), se = FALSE, colour = "red") +
labs(title = "MAPE by Neighborhood Mean Sales Price",
x = "Mean Home Price", y = "MAPE") +
plotTheme(),
ggplot(npa.mean.sf, aes(x=MIDD_NAME, y=mean_APE)) +
geom_point(alpha=0.5) +
labs(title="Sales Price vs. Prediction Error", subtitle="Charlotte Area Home Sales") +
ylab("Mean Absolute % Error") +
xlab("Observed Sales Price") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)))
st_drop_geometry(charlotte.test) %>%
group_by(MIDD_NAME) %>%
summarize(MAPE = mean(price.APE, na.rm = T)) %>%
ungroup() %>%
left_join(CLT_neighborhoods) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = MAPE), color=NA) +
scale_fill_gradient(low = palette5[5], high = palette5[1],
name = "MAPE") +
geom_sf(data = charlotte.test, colour = "black", show.legend = "point", size = .05) +
labs(title = "Mean test set MAPE by Block Groups", subtitle="Charlotte Metro Area") +
mapTheme()
We split the Charlotte Metro Area into two groups: “majority white” and “majority non-white” to test our model’s generalizability. Currently, white non-Hispanic folk represent 45.3% of Mecklenberg County’s population, so it is worthwhile to check whether the accuracy of our predictions is dependent on demographic context. (Source)
# Adding race data
Race <-
st_read("Population.geojson") %>%
select(c(9,14)) %>%
st_transform(st_crs(CLT_internal))
## Reading layer `Census_Population_Block_Groups' from data source
## `/Users/kemirichards/Documents/GitHub/Zestimate-Project/Population.geojson'
## using driver `GeoJSON'
## Simple feature collection with 555 features and 42 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -81.05803 ymin: 35.00171 xmax: -80.5503 ymax: 35.5151
## Geodetic CRS: WGS 84
# Remove those pesky NAs
Race <- filter(Race, `Population` != 0) %>%
na.omit
# Calculate percentage of white population
Race <- Race %>%
mutate(PctWhite = ((`White`/`Population`)*100))
# Creating majority white column
Race <- Race %>%
mutate(Race = ifelse(PctWhite > 50, "Majority White", "Majority Non-White"))
# Plot
ggplot() + geom_sf(data = na.omit(Race), aes(fill = `Race`)) +
scale_fill_manual(values = c("#FA7800", "honeydew 3"), name="Race Context") +
labs(title = "Race in Mecklenberg County, NC",
subtitle = "Charlotte Metro Area") +
mapTheme() + theme(legend.position="bottom")
Now, let’s check to see whether our model’s error(s) and predictions were consistent across varied demographic context:
# MAPE by race
Variable <- c("Majority Non-White", "Majority White")
Result <- c("30%", "25%")
MAPE_race <- data.frame(Variable, Result)
# Table
kable(MAPE_race, caption = "MAPE by Race") %>%
kable_styling("striped",full_width = F)
| Variable | Result |
|---|---|
| Majority Non-White | 30% |
| Majority White | 25% |
# Mean price by race
Variable <- c("Majority Non-White", "Majority White")
Result <- c("$305,016", "$508,226.50")
PRICE_race <- data.frame(Variable, Result)
# Table
kable(PRICE_race, caption = "Mean Predicted Price by Race") %>%
kable_styling("striped",full_width = F)
| Variable | Result |
|---|---|
| Majority Non-White | $305,016 |
| Majority White | $508,226.50 |
The tables above inform us that our model’s errors across a varied demographic context were inconsistent. In particular – our model experienced greater MAPE within Majority Non-White areas: an MAPE of 30% among Majority Non-White neighborhoods indicates on average, our predicted home price was 30% away from the actual value. This rate of error was 5 percentage points lower among Majority White neighborhoods. Likewise, our model’s mean prediction price is highest among Majority White neighborhoods, indicating a bias against Majority Non-White areas. For this reason, our model is not generalizable.
Our variables are able to explain ~50% of the variation observed in Charlotte home prices based upon our R squared. Our mean absolute error (MAE) is 106593.9, indicating that on average, our model’s price predictions differ from actual home prices by ~$106,593. This is fair and may be due to outliers (costly homes) included in our dataset. However, when evaluating our model within a racial context, it does not perform consistently – seemingly undervaluing homes in majority non-white neighborhoods (based upon average predicted price), and experiencing a greater mean absolute percentage error (MAPE) compared to majority white areas. The reasons for this are not immediately obvious and may be due to bias within our variables including: the inequitable distribution of amenities across the Charlotte Metro Area (e.g. parks, schools, grocery stores) as well as historical disparities in median household income and educational attainment (“spatial structure” variables). These features all factor into the viability and price of a home sale.
The inconsistency of our model’s performance within a racial context underscores the challenge in creating algorithms that are fair to all: Algorithms are not magic, but simply refer to historical data in an attempt to forecast the future. Unfortunately, real-life discrimination and disparity is embedded within this data, and an algorithm can easily exacerbate this bias if its primary objective is to increase ‘efficiency’ and greater care isn’t taken to account for these social ills (perhaps by adding weights, omitting discriminatory variables, demographic parity, etc).
Ultimately, it would not be socially responsible or ethically sound to publish an algorithm with easily identifiable biases. So, this means we’ll need to postpone our Zillow pitch meeting for now :)
Future iterations of this model should include sales data as it is a strong predictor towards home valuation (i.e. the sales’ price of your neighbor’s home is likely very similar to the sales price of your home). For this project, we were not allow to use this information which added a level of difficulty which tried to make up for by emphasizing other variables such as # of floors, bedrooms, bathrooms, housing quality, the quality of nearby schools, as well as a home’s proximity to nearby grocery stores. And of course, this model can be strengthened by accounting for biases in the datsets used in the model building process.